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ABSTRACT 

Tvo models vere identified for criterion«referenced 
tests, one based on the assumption of a continuous achievement 
variable and the other assuming a dichotomcus or binary variable. 
Several test characteristics vere examined and contrasted for the two 
models, including the distribution of scores, establishment of a 
cutting score, test length, item difficulty, and reporting of test 
information. In addition, the appropriateness of each model for 
measuring learning tasks involving verbal information or intellectual 
skills vas discussed. (Author) 
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University of Texas Heulth Science Center at Dallas 
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FTbrida State University 

Since introduction of the term "criterion-referenced" by Glaser (1963), 
a wide variety of definitions and interpretations of the term, as well as 
^alternative terms for similar concepts, have appeared in the literature. Many 
controversies have arisen over various characteristics of criterion-referenced 
tests. Much of the disagreement can be traced to differences in underlying 
assumptions, often unstated, about the nature of the achievement variable being 
measi!red. Once these assumptions are made public, it often becomes evident that 
opposing proponents are discussing different situations and that both may be 
correct. Many of the differences concerning the nature and use of criterion- 
referenced tests can be abated by considering more than one type of achievement 
variable. 

It is the contention of this paper that assumptions concerning the continuous 
or dichotomous nature of an achievement variable substantially affect the charac- 
teristics and use of a criterion-referenced test developed to measure the variable. 
It is further contended that different assumptions may be desirable for measure- 
ment of different domains of learning outcomes (Gagne, 1971). In particular, the 
assumption of continuity may be most appropriate in measuring verbal information 
outcomes whereas the assumption of dichotomy may be most appropriate in measuring 
outcomes described as intellectual skills. 



The Nature of Achievement Variables 
Problems have arisen In criterion-referenced measurement because of variation 
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In the manner that different types of achievement can be demonstrated. As 
noted by Popham & Husek (1969), "Some criterion-referenced tests yield scores 
which are essentially 'on-off In nature, that Is, the Individual has either 
mastered the criterion or he hasn't . . . more commonly, however, a range of 
acceptable perfonnance exists [p. 7]." Unfortunately, these differences in 
types of observable performance are often Ignored and tests of both types of 
performance are treated simllarlly. 

Most criterion-referenced test users assume that achievement Is distrib- 
uted as a continuous variable and that all levels of proficiency relative to an 
objective can exist. This assumption was first expressed by Glaser (1963) In 
his discussion of a "continuum of knowledge acquisition ranging from no pro- 
ficiency at all to perfect performance [p. 519]." 

A few users of criterion-referenced tests consider achievement as a binary 
variable and assume that all examinees are either masters or nonmasters of a 
specified objective. For example, Emrick (1971) stated, "mastery of each unitary 
skill is assumed to be an all or none variable [p. 322]." Regardless of whether 
achievement is considered as a binary or continuous variable, nearly all test 
users attempt to dichotomize scores to provide mastery and nonmastery classifi- 
cations of examinees. 

It appears that most developers of criterion-referenced tests consider all 
types of human performance to be similar. Gagne (1974) suggested, however, that 
five different classes of performance are readily distinguishable from each 
other. If Gagne is correct, it may be appropriate to employ different measure- 
ment models with differerjt types of learning outcomes. The following discussion 
focuses upon two of Gagne 's domains, verbal information and intellectual skills. 

According to Gagne and Briggs (1974) the verbal information domain en- 
compasses the learning of labels, single facts, and organized information or 
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knowledge. One might argue that single units of verbal Informatlort such as 
labels or single facts are recalled In an all or none manner. Even If this 
Is tfue, the measurement of single units of information Is probably a trivial 
operation in most instances. Seldom is a singie unit of Information considered 
of sufficient importance to be tested separately. More commonly, a collection 
of information, preferably interrelated to comprise a body of organized know- 
ledge, is tested simultaneously. A collection of information forms a content 
domain from which items are randomly sampled. Performance of an examinee 
relative to the entire domain depends upon the number of discrete units of 
information that have been acquired and remembered. If it is assumed that 
achievement of each cf the discrete units of information is demonstrated inde- 
pendently, any proficiency from 0-100% might be demonstrated on a test. Thus, 
achievement of verbal information measured by a domain-referenced test would be 
demonstrated as a continuous variable. 

A stronger case can be established for the measurement of single intellec- 
tual skills than for the measurement of single units cf verbal information. 
While a single verbal proposition represents only one behavior, a single intel- 
lectual skill encompasses an entire class of behaviors. If the research on 
learning hierarchies is valid, the intellectual skill may constitute a pre- 
requisite for a number of other skills, whereas the verbal information may have 
limited utility for other learning. In addition, the measurement of a collection 
of intellectual skills may present serious scaling problems. If hierarchical 
dependencies exist, combining scores from different levels of the hierarchy may 
be analogous to adding feet and inches. 

Since an intellectual skill defines an entire class of behaviors, a large 
number of parallel items could be generated to measure a single skill. Theo- 
retically, a learner who acquires the intellectual skill would be able to 



demonstrate the entire class of behaviors while the learner who has not 
acquired the skill would be unable to perform any of the behaviors. Accord- 
ingly^ achievement of an intellectual skill wo'ild be demonstrated as a binary 
variable. 

In a previous paper, the author (Graham, 1974) adopted the terms competency 
test and proficiency test to differentiate between tests constructed to measure 
the two different types of achievement variables. The term competency test 
was used to describe a criterion-referenced test of achievement that is demon- 
strated as a binary variable, while the term proficiency test was reserved for 
a criterion-referenced instrument constructed to measure a learning variable 
which can be achieved to any degree. It seems appropriate to consider a continuum 
of proficiencies but only two states of competency, mastery and nonmastery. This 
restricted usage of the terms competency test and proficiency test is followed 
in the remainder of the present paper. 

Basic Assumptions 

The discussion above presents a case for two different criterion-referenced 
measurement models for the assessment of learning outcomes. A binary model 
would be necessary for the measurement of intellectual skills, while a continuous 
model would be more appropriate for assessing achievement of verbal information. 
Let us look briefly at the assumptions and corollaries of these two models. 

Binary Model 

The critical assumption in the binary model is that certain capabilities 
enable an individual to perform an entire class of behaviors, and if the capa- 
bility is not acquired, the individual cannot perform any of the class of 
behaviors. Since a series of items sampled from a domain representing the class 
of behaviors are measuring the same learned capability, responses to the items 
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are expected to be highly Intercorrelated. Theor-tically, true scores for 
individuals relative to the item domain representing the class of behaviors 
win be either zero or 1002. Deviations from these all or nothing scores 
are caused by measurement error and do not accurately reflect the true capa- 
bility of the individual. 

It was previously stated that achievement in the domain of human per- 
formance referred to as Intellectual skills appears to provide an appropriate 
situation for application of the binary model. In the study noted earlier 
(Graham, 1974) the author employed a strict item-sampling model to generate 
domain-referenced tests of intellectual skills. The tests displayed the 
characteristics expected for a binary achievement variable. Horwitz (1974) 
and Bergqulst and Horwitz (1975) also demonstrated the viability of the binary 
model with tests constructed to measure unitary, explicitly defined intellectual 
skills. Performance on tests constructed in these studies was essentially all 
or none resulting in high interitem correlations. 

It might be useful to examine an example of a competency test of the intel- 
lectual skill domain. In developing a test to measure the skill of adding 
negative integers, Bergqulst and Horwitz (1975) randomly selected 10 items from 
the total domain of addition problems comprised of two negative integers. The 
test was administered to 67 eighth grade students. More than 94% of the 
examinees scored outside the range 2-6 with approximately one-fourth of the 
students falling all items and approximately one-half of the students receiving 
perfect scores. It seems reasonable that scores falling in the middle of the 
range actually represented measurement error resulting from such factors as 
carelessness, fatigue, guessing, and cheating and that the students were either 
capable or not capable of adding two negative one digit integers. 
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Continuous Model 

In many situations^ achievement Is expected to be demonstrated as a 
continuous variable. The model Is based on the major assumption that certain 
learned capabilities exist for which only a slnglt behavior can be demon- 
strated, and unless that capability Is of major Importance, It should be 
measured as part of a collection of behaviors comprising a larger domain. It 
Is further assumed that performance of one capability Is independent of per- 
formance of the other capabilities In the collection. Relative to a domain of 
Independent capabilities the true score of an Individual Is determined by the 
number of Individual capabilities that have been acquired and may assume any 
value from zero to 100% of the capabilities comprising the collection. 

Domain -referenced tests of verbal Information would warrant consideration 
of this model. Most educators have considerable familiarity with tests of 
verbal Information. Even tests Intended to measure achievement of Intell^xtual 
skills are often constructed In such a manner that It Is pi?3s1ble to provide 
correct responses through recall of related verbal Information without actually 
demonstrating the skill of Interest. 

A typical example of a verbal Information test can be drawn from the 
Physician's Assistant Program with which the author Is associated. Trainees in 
the program are expected to learn 214 common medical abbreviations. It is 
possible tor individu^^. students to learn any number of abbreviations from the 
total collection and thus possess any true proficiency relative to the total 
collection. The proportion of correct responses provided by a student on a 
random sample of items from the domain of medical abbreviations would provide 
an unbiased estimate of the examinee's true proficiency with respect to the 
domain. 
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Implications 

Many of the controversial Issues concerning criterion-referenced measuremant 
are given new perspective when considered In the context of alternative achieve- 
ment variable models. In the discussion that follows, several characteristics 
of criterion-referenced tests are examined In relation to binary and continuous 
achievement variables. 

Score Distributions 

In the previous section. It was Indicated t^at under the assumption of 
achievement as a binary variable, only two performance capabilities, mastery 
and nonmastery, are expected. Theoretically, true scores for all members of 
the mastery population are 100% while true scores for all nonmasters are zero. 
Deviation of observed scores from these two levels is attributed to measurement 
error. When such a test is administered to a group comprised of both masters 
and nonmasters, the scores would be expected to be distributed bimodally. In 
the studies by Graham (1974), Horwitz (1974), and Bergquist and Horwitz (1975), 
quite pronounced bimodal characteristics were obtained for score distributions 
on the tests of intellectual skills. In addition, Graham obtained a frimodal 
score distribution for a test constructed to measure two intellectual skills 
simultaneously. 

The anticipated distribution of scores on proficiency tests would be quite 
different than for competency tests. Since all true proficiencies would theo- 
retically exist, the score distribution on a given test administration would be 
determined by the level of attainment of the sample tested. With a large random 
sample of individuals in a traditional, time-based learning environment, scores 
would likely be normally distributed. On the other hand, pretest and posttest 
scores in a mastery learning situation would no doubt be highly skewed. A 
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bimodal score distribution for a single administration of a proficiency test, 
however, would be highly unusual. 

A major point of controversy in discussions of criterion-referenced test 
characteristics nas been the issue of score variance. The introduction of two 
achievement variable models does not directly address this issue. For compe- 
tency tests, however, considerable score variability would exist except in the 
special case when only masters or nonmasters are included in the test sample. 

Test Homogeneity 

One of the most important implications of a dual concept for achievement 
variables concerns the homogeneity of an Item set. Some advocates of criterion- 
referenced measurement believe that a test of a single behavioral objective 
should be homogeneous in form, content, and difficulty while others argue that 
a highly homogeneous test measures an overly restricted item domain. This 
controversy should be examined in reference to the alternative measurement models. 

An information objective can be stated to describe a single behavior or a 
collection of behaviors. In most instances, the measurement of a single behavior 
is probably a trivial or at least an inefficient operation. It is usually advan- 
tageous to define a collection or domain of similar behaviors and to dravz 
inferences about capabilities relative to the entire domain through item-sampling 
procedures. In this situation, item homogeneity would depend upon the similarity 
of the behaviors comprising the domain. To the extent that increasing the size 
of the domain would tend to exhaust the supply of similar behaviors, item homo- 
geneity would be dependent upon domain size. Since performance on one item is 
assumed to be independent of performance on another, items would be expected 
to display a range of difficulty values. Thus, a test of a domain of verbal 
information would not necessarily be homogeneous in content or difficulty. 
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On the other hand competency test Items are not Independent of each other. 
Since an Intellectual skill domain defines a class of behaviors, each Item 
provides a repeated measure of the same skill or behavior. Consequently, Item 
homogeneity would be a necessary characteristic of a good competency test. 
Deviations from a high degree of Item homogeneity Indicate confounding of 
measurement with other skills or verbal Information. A test that simultaneously 
measures more than one class of performance would not possess the characteristic 
described by Gagne (1968) as distinctiveness. 

It Is stildom possible to construct a test, or even a single Item, which Is 
so distinctive that It measures only one Intellectual skill. The measurement of 
Intellectual skills is always confounded with the simultaneous measurement of 
other capabilities. If all members of the test population have mastered the 
extraneous capabilities, however, the confounding does not interfere with measure- 
ment of the specific skill defined by an objective, and the test possesses the 
quality of distinctiveness. 

There Is at least one situation In which differential capabilities of 
examinees to perform supplementary skills cannot be detected. This situation 
exists whenever the supplementary skill or skills are uniformly required for all 
items In a test. An example Is the need for prerequisite reading skills for 
solution of any verbally stated mathematics problem. In such situations, the 
supplementary skills should either be specified as part of the objective or should 
be measured Independently in a separate pretest to ascertain their Influence upon 
misclassification of certain examinees. Intensive investigation into the effects 
of measurement confounding and into appropriate means of handling this problem 
appears warranted. 

Item difficulty values for a competency test are actually average difficulty 
values that depend upon the composition of the test sample. For such a test, 
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item difficulty is actually a function of the learning state of the examinee. 
Hypothetical ly, the difficulty values for the mastery and nonmastery populations 
should be one and zero respectively. Thus, whenever an examination sample is 
comprised of both masters and nonmasters of an intellectual skill, the magnitude 
of the difficulty value for an item depends upon the relative representation of 
the two competency populations in the test administration sample. 

In a discussion of reliability, Stanley (1971) demonstrated that the only 
time dichotomously scored items can be perfectly intercorrelated, resulting in 
the maximum value of one for KR-2Q, is when all items have equal difficulty. The 
author (Graham, 1974) repeatedly obtained KR-20 estimates of reliability well 
above 0.9 for 10-item tests of intellectual skills. for which items were randomly 
generated. In the study, item-test correlation coefficients above 0.7 were 
the rule rather than the exception. Instances in which single items deviated 
in difficulty value from other it^s of a test could be explained by differences 
in the supplementary capabilities required for correct responses to the Items. 
This investigation provided strong evidence for the binary ♦ ture of intellectual 
skill achievement. 

Passing Scores 

For a variety of reasons, educators often wish to establish a minimum 
standard of acceptable performance on a domain-referenced achievement test. 
Kriev/all (1969) suggested that such standards should be formulated as part of 
the design specifications outlined during curriculum development. At the present 
time, Hambleton and Novick (1973) believe that, "the establishment of proficiency 
levels is primarily a v'lue judgement [p. 163]." To assist in this judgement, 
Millman (1973) discussed five factors that should be considered in the deter- 
mination of performance standards. Once a performance standard has been 
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established, 1t must then be translated Into a passing score for a given 
sample of Items from the domain. Factors other than the performance standard 
that Influence passing scores are test length and the relative seriousness of 
the two types of classification error. 

For situations In which achievement Is demcnstra^ binary variable, 

there Is no need for establishing a performance standaru. Since only two 
performance capabilities are assumed to exist. It Is unnecessary to operationally 
define the mastery state. A passing score Is established at a level that tends 
to minimize the number of examinees that are misclasslfled due to measurement 
error. Figure 1 presents the frequencies of scores obtained on a 10-1tem test 
acftnlnlstered by the author (Graham, 1974). With bimodal score distributions^ 3f 
this type, It Is most convenient to establish a passing score simph' by Inspection. 
Since less than 15% of the examinees received scores In the range 1-8, selection 

400 r 
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Figure 1. Distribution of obtained on a domain-referenced 
test of an Intellectual skill (Graham, 1974). 
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of any score within this range as a passing score would not substantially alter 
the classification results. The nature of the consequences of misclassifying 
true masters or true nonmasters would influence the selection of a specific 
passing score within this range. 

Test Length 

The determination of test length is closely related to passing score and 
errors of classification. For situations in which a continuum of proficiencies 
is assumed to exist, Kriewall (1969) employed acceptance sampling procedures 
based upon the classical binomial model for establishing test length. Millman 
(1972) used the binomial model to construct tables that relate test length to 
classification accuracy for various passing scores. By assuming prior informa- 
tion about an examinee's level of functioning, Novick and Lewis (1974) introduced 
a more precise method of determining test length based upon a Bayesian model. 
These procedures appear useful for determining the number of items required for 
a domain-referenced proficiency test. 

The binomial and Bayesian procedures are appropriate for proficiency tests 
because they make no assumptions about the homogeneity of items. For a compe- 
tency test, however, the number of items required to provide reliable m^s-tery 
classifications of examinees is closely related to item homogeneity. Without a 
homogeneous item set, the bimodal characteristics of the score distribution would 
not be pronounced. Unless the distinctiveness of a measure can be increased to 
produce a more homogeneous set of Items, a greater number of items will be 
necessary to minimize the amount of classification error. 

Figure 1 indicates that ten items were more than enough for classifying 
examinee:, on the behavior of interest. For tests of unitary intellectual skills, 
in which there is little confounding with subordinate skills or related information 
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three-five Items are probably sufficient for providing reasonably reliable 
classifications. If this Is true, considerably fewer Items are necessary for 
measuring a homogeneous class of behaviors than Is necessary for measuring a 
collection of behaviors such as verbal Information. 

Reporting Results 

The final consideration Involving the relationship between criterion- 
referenced test characteristics and assumptions about the nature of achievement 
variables concerns the reporting of test results. Sensible reporting of the 
results on a competency test should probably be binary (e.g., master-nonmaster 
or pass-fall). The purpose of the test Is to determine in which category the 
student actually belongs. Deviations from these categories are assumed to be 
attributable only to error and need not be included in the score reporting. 

Scores from a test of an achievement continuum are expected to reflect the 
underlying range of capabilities. These scores are more meaningfully expressed 
as a percentage passed or a proficiency level. Even if the information from a 
proficiency test is used to divide the group into mastery and nonmastery classi- 
fications through an established passing score, it appears unjustifiable not to 
inform the students of the obtained estimate of his true level of proficiency. 



Summary 

It was suggested that different measurement models may be required for 
assessing different types of learning outcomes. In particular, intellectual 
skills apparently encompass classes of behaviors that are demonstrated in an all 
or none manner, while a collection of verbal information can be achieved to 
varying degrees. If this is true; mastery seems more relevant to skill learning 
and proficiency is a more important concept for information. By considering 
alternative measurement models for these two situations, a new perspective is 
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provided for viewing the contradictions and controversies related to criterion- 
referenced measurement theory. Table 1 summarizes some of the implications 
that alternative achievement variable models may have for different character- 
istics of criterion-referenced tests. 

Table 1 



Relation of Criterion-Referenced Test Characteristics 
to Assumptions about the Nature of Achievement Variables 



C ri ter i o n-Ref erenced 


Achievement Variable Model 


Test Characteristics 


Binary 


Continuous 


Name 


Competency Test 


Proficiency Test 


Application 


Intellectual Skills 


Verbal Information 


Type of Performance 


Class of Behaviors 


Collection of Behaviors 


Score Distribution 


Bimodal 


Variable (Depends on item 
domain and test administration 
sample.) 


Test Homogeneity 


Desirable (Characteristic 
of a good test.) 


Unnecessary (Often indicates 
an overly-restricted item 
domain.) 


Passing Score 


Established by determining 
point of minimal overlap of 
distribution. 


Established to maintain perfor- 
mance standard (judgment) in 
conjunction with test length 
and error probability (Binomial 
or Bcyesian methods). 


Test Length 


Determined by homogeneity 
and importance of correct 
classification. 


Determined by passing score and 
error probability (Binomial or 
Bayesian methods). 


Reporting Results 


Dichotomous (Pass-Fail or 
Ma s tery -No nma s tery ) 


Proficiency estimate 
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Many domain-referenced tests have been constructed, either Intentionally 
or unintentionally, to measure collections of several Intellectual skills and 
a variety of verbal Information simultaneously. In such situations It Is 
virtually Impossible to draw Inferences concerning what the examinee can and 
cannot do. If Items are randomly sampled from a domain of clearly defined 
verbal Information It Is possible to Infer the examinee's capability or pro- 
ficiency relative to the entire domain. Likewise, measurement of a unitary 
intellectual skill permits conclusions concerning whether or not the skill has 
been mastered. Combining of skills with other skills or verbal Information 
results In confounding of measurement that makes any conclusion tenuous. ^ 
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